{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Ph_DM1pCjktz"
      },
      "source": [
        "## Data Ingestion into MongoDB Database\n",
        "\n",
        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mongodb-developer/GenAI-Showcase/blob/main/notebooks/rag/chat_with_pdf_mongodb_openai_langchain_POLM_AI_Stack.ipynb)\n",
        "\n",
        "**Steps to creating a MongoDB Database**\n",
        "- [Register for a free MongoDB Atlas Account](https://www.mongodb.com/cloud/atlas/register?utm_campaign=devrel&utm_source=workshop&utm_medium=organic_social&utm_content=rag%20to%20agents%20notebook&utm_term=richmond.alake)\n",
        "- [Create a Cluster](https://www.mongodb.com/docs/guides/atlas/cluster/)\n",
        "- [Get your connection string](https://www.mongodb.com/docs/guides/atlas/connection-string/)\n",
        "\n",
        "## Vector Index Creation\n",
        "\n",
        "- [Create an Atlas Vector Search Index](https://www.mongodb.com/docs/compass/current/indexes/create-vector-search-index/)\n",
        "\n",
        "- If you are following this notebook ensure that you are creating a vector search index for the right database(anthropic_demo) and collection(research)\n",
        "\n",
        "Below is the vector search index definition for this notebook\n",
        "\n",
        "```json\n",
        "{\n",
        "  \"fields\": [\n",
        "    {\n",
        "      \"numDimensions\": 1536,\n",
        "      \"path\": \"embedding\",\n",
        "      \"similarity\": \"cosine\",\n",
        "      \"type\": \"vector\"\n",
        "    }\n",
        "  ]\n",
        "}\n",
        "```\n",
        "\n",
        "- Give your vector search index the name \"vector_index\" if you are following this notebook\n",
        "\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4FNJEHGdj-cc"
      },
      "source": [
        "## Code"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "JFD8rcTYE-EZ",
        "outputId": "85b7fc63-40ea-407e-d97b-d92251e37ea8"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m40.3/40.3 kB\u001b[0m \u001b[31m1.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.1/1.1 MB\u001b[0m \u001b[31m25.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25h"
          ]
        }
      ],
      "source": [
        "! pip install --quiet langchain pymongo langchain-openai langchain-community pypdf"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "EZwLZmCB_FrY",
        "outputId": "370ae9b6-4ef1-4ba3-f196-bfc2b5dc8cf3"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Answer: The document is about a significant advance in understanding the inner workings of AI models, specifically focusing on the interpretation of the features inside a large language model called Claude Sonnet. It discusses how millions of concepts are represented within the model, the ability to manipulate these features to see how the model's responses change, and the potential implications for making AI models safer and more trustworthy.\n",
            "Sources:\n",
            "- mapping_llms.pdf: As for the scientific risk, the proof is in the pudding.\n",
            "We successfully extracted millions of featu...\n",
            "- mapping_llms.pdf: A map of the features near an \"Inner Conflict\" feature, including clusters\n",
            "related to balancing trad...\n",
            "- mapping_llms.pdf: Interpret\u0000bility\n",
            "M apping the M ind of a Large\n",
            "Language M odel\n",
            "21 May 2024\n",
            "Today we report a signifi...\n",
            "- mapping_llms.pdf: English word in a dictionary is made by combining letters, and\n",
            "every sentence is made by combining w...\n",
            "- mapping_llms.pdf: answer  I have no physical form, I am an AI model  changed\n",
            "to something much odder: \"I am the Golden...\n"
          ]
        }
      ],
      "source": [
        "import os\n",
        "\n",
        "from google.colab import userdata\n",
        "from langchain.chains import RetrievalQA\n",
        "from langchain.chat_models import ChatOpenAI\n",
        "from langchain.embeddings import OpenAIEmbeddings\n",
        "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
        "from langchain.vectorstores import MongoDBAtlasVectorSearch\n",
        "from langchain_community.document_loaders import PyPDFLoader\n",
        "from pymongo import MongoClient\n",
        "\n",
        "# Set up your OpenAI API key\n",
        "os.environ[\"OPENAI_API_KEY\"] = userdata.get(\"OPENAI_API_KEY\")\n",
        "\n",
        "# Set up MongoDB connection\n",
        "mongo_uri = userdata.get(\"MONGO_URI\")\n",
        "db_name = \"anthropic_demo\"\n",
        "collection_name = \"research\"\n",
        "\n",
        "client = MongoClient(mongo_uri, appname=\"devrel.showcase.chat_with_pdf\")\n",
        "db = client[db_name]\n",
        "collection = db[collection_name]\n",
        "\n",
        "# Set up document loading and splitting\n",
        "loader = PyPDFLoader(\"mapping_llms.pdf\")\n",
        "documents = loader.load()\n",
        "\n",
        "text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)\n",
        "texts = text_splitter.split_documents(documents)\n",
        "\n",
        "# Set up embeddings and vector store\n",
        "embeddings = OpenAIEmbeddings()\n",
        "vector_store = MongoDBAtlasVectorSearch.from_documents(\n",
        "    texts, embeddings, collection=collection, index_name=\"vector_index\"\n",
        ")\n",
        "\n",
        "# Set up retriever and language model\n",
        "retriever = vector_store.as_retriever(search_type=\"similarity\", search_kwargs={\"k\": 5})\n",
        "llm = ChatOpenAI(model_name=\"gpt-3.5-turbo\", temperature=0)\n",
        "\n",
        "# Set up RAG pipeline\n",
        "qa_chain = RetrievalQA.from_chain_type(\n",
        "    llm=llm, chain_type=\"stuff\", retriever=retriever, return_source_documents=True\n",
        ")\n",
        "\n",
        "\n",
        "# Function to process user query\n",
        "def process_query(query):\n",
        "    result = qa_chain({\"query\": query})\n",
        "    return result[\"result\"], result[\"source_documents\"]\n",
        "\n",
        "\n",
        "# Example usage\n",
        "query = \"What is the document about?\"\n",
        "answer, sources = process_query(query)\n",
        "print(f\"Answer: {answer}\")\n",
        "print(\"Sources:\")\n",
        "for doc in sources:\n",
        "    print(f\"- {doc.metadata['source']}: {doc.page_content[:100]}...\")\n",
        "\n",
        "# Don't forget to close the MongoDB connection when done\n",
        "client.close()"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    },
    "widgets": {
      "application/vnd.jupyter.widget-state+json": {
        "state": {}
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
